Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowpark integration: Add support for uncollected dataframes in st.map and improve st.table data collection #5590

Merged
merged 8 commits into from Oct 25, 2022

Conversation

sfc-gh-tszerszen
Copy link
Contributor

@sfc-gh-tszerszen sfc-gh-tszerszen commented Oct 23, 2022

📚 Context

This PR improves Snowpark integration in Streamlit, it introduces changes like:

  • adds support for uncollected Snowpark dataframes and tables in st.map component
  • adds support for Snowpark table auto-collection (snowflake.snowpark.table.Table) in Streamlit components, before only unevaluated Dataframes were collected
  • Limits auto-collection for st.table to 100 rows, instead of 10k rows. When 10k in st.table was collected it was buggy, page took long to load, app sometimes crashed and the page was very very long (it was discussed with @jrieke)
  • What kind of change does this PR introduce?

    • Bugfix
    • Feature
    • Refactoring
    • Other, please describe:

🧠 Description of Changes

  • Add bullet points summarizing your changes here

    • This is a breaking API change
    • This is a visible (user-facing) change

Revised:

Insert screenshot of your updated UI/code here

Current:

Insert screenshot of existing UI/code here

🧪 Testing Done

  • Screenshots included
  • Added/Updated unit tests
  • Added/Updated e2e tests

🌐 References

Does this depend on other work, documents, or tickets?

  • Issue: Closes #XXXX

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

@sfc-gh-tszerszen sfc-gh-tszerszen self-assigned this Oct 23, 2022
@sfc-gh-tszerszen sfc-gh-tszerszen changed the title Snowpark integration: Add support for snowflake.snowpark.table.Table Snowpark integration Oct 23, 2022
@sfc-gh-tszerszen sfc-gh-tszerszen changed the title Snowpark integration Further improvements int Snowpark integration Oct 23, 2022
@sfc-gh-tszerszen sfc-gh-tszerszen changed the title Further improvements int Snowpark integration Further improvements of Snowpark integration Oct 23, 2022
@sfc-gh-tszerszen sfc-gh-tszerszen force-pushed the add-support-for-snowpark branch 4 times, most recently from 3c17c77 to 5548d42 Compare October 24, 2022 16:38
@sfc-gh-tszerszen sfc-gh-tszerszen changed the title Further improvements of Snowpark integration Snowpark integration: Add support for uncollected dataframes in st.map and improve st.table data collection Oct 24, 2022
@sfc-gh-tszerszen sfc-gh-tszerszen marked this pull request as ready for review October 24, 2022 17:26
return json.dumps(_DEFAULT_MAP)

if hasattr(data, "empty"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified a bit:

if hasattr(data, "empty") and data.empty:
    return json.dumps(_DEFAULT_MAP)

And I think we can remove the ignore and the comment from harahu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so version with and requires # type: ignore, which makes @harahu comment completely up to date and valid I think 🤔

Also you meant and right? Not & operator?

I've updated the PR with valid example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, meant and. Ok, in that case we can leave the comment. I that hasattr(data, "empty") would give enough hints to the typing tool for this 🤔

Copy link
Collaborator

@LukasMasuch LukasMasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good 👍 I added a few comments

@@ -186,7 +194,7 @@ def to_deckgl_json(data: Data, zoom: Optional[int]) -> str:
# the empty attribute. This is either a bug, or the documented data type
# is too broad. One or the other should be addressed, and the ignore
# statement removed.
if data[lon].isnull().values.any() or data[lat].isnull().values.any(): # type: ignore[index]
if data[lon].isnull().values.any() or data[lat].isnull().values.any():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we can remove the ignore statement here, we can also remove the comment from harahu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's removed 👍

if df.shape[0] == MAX_UNEVALUATED_DF_ROWS:
if (
is_type(df, _SNOWPARK_DF_TYPE_STR)
and not isinstance(df, list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the and not isinstance(df, list) is not really necessary here, since if it is of type _SNOWPARK_DF_TYPE_STR or _SNOWPARK_TABLE_TYPE_STR it anyways cannot be a list. An alternative might be something like:

is_snowpark_dataframe(df) and not isinstance(df, list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's updated now 👍

st.caption(
f"⚠️ Showing only 10k rows. Call `collect()` on the dataframe to show more."
f"⚠️ Showing only {'10k' if max_unevaluated_rows == MAX_UNEVALUATED_DF_ROWS else str(max_unevaluated_rows)} rows. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: at this point, it might be worth adding a function to string_util that automatically does this, e.g.:

def simplify_number(num: int) -> str:
    num_converted = float("{:.2g}".format(num))
    magnitude = 0
    while abs(num_converted) >= 1000:
        magnitude += 1
        num_converted /= 1000.0
    return "{}{}".format(
        "{:f}".format(num_converted).rstrip("0").rstrip("."),
        ["", "k", "m", "b", "t"][magnitude],
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your func is created in string_util and units tests are added 👍 Thank you 🙇

if data.empty: # type: ignore[union-attr]
return json.dumps(_DEFAULT_MAP)

if type_util.is_snowpark_dataframe(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually can just always call convert_anything_to_df here without the is_snowpark_dataframe check. I don't think there is any reason why map should not have the same support for dataframe-like input types as our other elements. In that case, the data = pd.DataFrame(data) in line 200 would need to be removed and the docstring would need to be updated to cover the same types as st.dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we call always convert_anything_to_df 👍

@@ -241,9 +242,11 @@ def is_dataframe_like(obj: object) -> TypeGuard[DataFrameLike]:


def is_snowpark_dataframe(obj: object) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe it is better to rename this to be a bit more generic - e.g. something like is_snowpark_data_object - since it is not anymore just about the actual snowpark dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's renamed 👍

@sfc-gh-tszerszen
Copy link
Contributor Author

@LukasMasuch thank you for your review 🙇 I've updated the code accordingly 👍

Copy link
Collaborator

@LukasMasuch LukasMasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

"""Try to convert different formats to a Pandas Dataframe.

Parameters
----------
df : ndarray, Iterable, dict, DataFrame, Styler, pa.Table, None, dict, list, or any

max_unevaluated_rows: int, If unevaluated data is detected this func will evaluate it,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you need to put the description here in a new line.

st.caption(
f"⚠️ Showing only 10k rows. Call `collect()` on the dataframe to show more."
f"⚠️ Showing only {string_util.simplify_number(max_unevaluated_rows)} rows. "
f"Call `collect()` on the dataframe to show more."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you can remove the f here in the second line (since there are no variables to replace)

@sfc-gh-tszerszen sfc-gh-tszerszen merged commit 082a7ba into develop Oct 25, 2022
@sfc-gh-kmcgrady sfc-gh-kmcgrady deleted the add-support-for-snowpark branch October 5, 2023 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants